perf: improve HunyuanVideo1.5 I2V runtime and VAE decode controls#1201
perf: improve HunyuanVideo1.5 I2V runtime and VAE decode controls#1201starrkk wants to merge 5 commits into
Conversation
(cherry picked from commit d60b8f32c7787054faba8fbacaf5c38fac3ffbfb)
(cherry picked from commit e8ee93a79bd20dce2d084e992a8e140710f2c9b6)
(cherry picked from commit b066001a517b59e5ddbf8f7dcce4a14a017be46d)
There was a problem hiding this comment.
Code Review
This pull request introduces several enhancements and compatibility fixes across the repository, including conditional pipeline imports, backward-compatible unpadding for attention layers, VAE post-processing utilities (such as cropping and rank-0 post-processing skips), detailed timing logs, and fallback support for 4D tensors in SDPA. The code review identified three issues: a runtime AttributeError due to the non-existent is_cpu attribute on PyTorch tensors, a critical layout detection bug in _spatial_dims when the frame count is 16 or 32, and a potential AttributeError when accessing seq_p_group directly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if cu_seqlens_q is not None and cu_seqlens_q.is_cpu: | ||
| cu_seqlens_q = cu_seqlens_q.to(q_flat.device, non_blocking=True) | ||
| if cu_seqlens_kv is not None and cu_seqlens_kv.is_cpu: | ||
| cu_seqlens_kv = cu_seqlens_kv.to(k_flat.device, non_blocking=True) |
There was a problem hiding this comment.
In PyTorch, torch.Tensor does not have an is_cpu attribute. Attempting to access cu_seqlens_q.is_cpu or cu_seqlens_kv.is_cpu will raise an AttributeError at runtime.
Instead, you should check the device type using .device.type == "cpu".
| if cu_seqlens_q is not None and cu_seqlens_q.is_cpu: | |
| cu_seqlens_q = cu_seqlens_q.to(q_flat.device, non_blocking=True) | |
| if cu_seqlens_kv is not None and cu_seqlens_kv.is_cpu: | |
| cu_seqlens_kv = cu_seqlens_kv.to(k_flat.device, non_blocking=True) | |
| if cu_seqlens_q is not None and cu_seqlens_q.device.type == "cpu": | |
| cu_seqlens_q = cu_seqlens_q.to(q_flat.device, non_blocking=True) | |
| if cu_seqlens_kv is not None and cu_seqlens_kv.device.type == "cpu": | |
| cu_seqlens_kv = cu_seqlens_kv.to(k_flat.device, non_blocking=True) |
| def _spatial_dims(video): | ||
| if video.ndim != 5: | ||
| return -2, -1 | ||
| # VAE tensors are usually B,C,T,H,W before wan_vae_to_comfy and | ||
| # B,T,H,W,C afterwards. In both layouts, H/W are the two dims before | ||
| # channels only for the postprocessed form; rank0 crop is done before it. | ||
| if video.shape[1] in (1, 3, 4, 16, 32): | ||
| return 3, 4 | ||
| return 2, 3 |
There was a problem hiding this comment.
There is a critical layout detection bug in _spatial_dims.
If the video tensor is in B,T,H,W,C layout (postprocessed form) and the number of frames T is 16 or 32 (which are extremely common video lengths), video.shape[1] will be 16 or 32.
Since 16 and 32 are in (1, 3, 4, 16, 32), the function will return 3, 4 (treating W and C as spatial dimensions) instead of 2, 3 (treating H and W as spatial dimensions). This will cause incorrect cropping of the channel dimension and corrupt the output.
To make this robust, we can check if the last dimension is the channel dimension (typically 1, 3, 4, 16, 32) and is smaller than or equal to the second dimension (T).
| def _spatial_dims(video): | |
| if video.ndim != 5: | |
| return -2, -1 | |
| # VAE tensors are usually B,C,T,H,W before wan_vae_to_comfy and | |
| # B,T,H,W,C afterwards. In both layouts, H/W are the two dims before | |
| # channels only for the postprocessed form; rank0 crop is done before it. | |
| if video.shape[1] in (1, 3, 4, 16, 32): | |
| return 3, 4 | |
| return 2, 3 | |
| def _spatial_dims(video): | |
| if video.ndim != 5: | |
| return -2, -1 | |
| # B,T,H,W,C layout: last dimension is channels (typically 1, 3, 4, 16, 32) | |
| # and is smaller than or equal to the second dimension (T). | |
| if video.shape[-1] in (1, 3, 4, 16, 32) and video.shape[1] >= video.shape[-1]: | |
| return 2, 3 | |
| # B,C,T,H,W layout: second dimension is channels | |
| if video.shape[1] in (1, 3, 4, 16, 32): | |
| return 3, 4 | |
| return 2, 3 |
|
@starrkk In addition, please minimize the use of environment variables as switches and use JSON-formatted configuration files whenever possible. Also, please pay attention to the code format:
|
Summary
Why
This groups the HunyuanVideo1.5 I2V runtime changes that were validated together for 8-card Hygon DCU inference. This is intentionally opened as a draft because it is broader than the smaller PRs and may be easier to review after splitting further.
Validation
ModelTC/LightX2V:main(89dfa833)git diff --checkpassed for the PR branch